Neuro-semantic prediction of user decisions to contribute content to online social networks

Understanding at microscopic level the generation of contents in an online social network (OSN) is highly desirable for an improved management of the OSN and the prevention of undesirable phenomena, such as online harassment. Content generation, i.e., the decision to post a contributed content in the OSN, can be modeled by neurophysiological approaches on the basis of unbiased semantic analysis of the contents already published in the OSN. This paper proposes a neuro-semantic model composed of (1) an extended leaky competing accumulator (ELCA) as the neural architecture implementing the user concurrent decision process to generate content in a conversation thread of a virtual community of practice, and (2) a semantic modeling based on the topic analysis carried out by a latent Dirichlet allocation (LDA) of both users and conversation threads. We use the similarity between the user and thread semantic representations to built up the model of the interest of the user in the thread contents as the stimulus to contribute content in the thread. The semantic interest of users in discussion threads are the external inputs for the ELCA, i.e., the external value assigned to each choice.. We demonstrate the approach on a dataset extracted from a real life web forum devoted to fans of tinkering with musical instruments and related devices. The neuro-semantic model achieves high performance predicting the content posting decisions (average F score 0.61) improving greatly over well known machine learning approaches, namely random forest and support vector machines (average F scores 0.19 and 0.21).


Introduction
There is a huge amount of research literature dealing with diverse aspects of the analysis of on-line social networks (OSN). Classical research efforts are devoted to identify communities within the network [14,39,64], finding influencers or key members of the virtual community [6,10,29,31,34,47,63,72], or describing the evolution of specific networks [25,54,62]. There is, however, very little or no work on the actual decision process conducting a user to publish some content in the OSN, e.g., posting a message in a forum of a virtual community of practice (VCoP). A VCoP implemented as an internet web-based Forum is a virtual place where members interact, discuss ideas, share, and generate knowledge about specific topics organized into sub-forums and discussion threads. Content generation is a radically different process from the propagation effects across the OSN that follow the publication of some new content. For instance, publishing a tweet is radically different from retweeting, sharing, liking, or any other propagation process that spreads the influence of the original tweet content. Synthetic content generation, such as n-gram Markov models allowing to generate fake tweets that are difficult to distinguish by humans [66], are out of the scope of the paper.
The decision to contribute a post to a discussion thread of a VCoP is a phenomenon affected by multiple factors like the user's knowledge of the subject, his preferences, other users participating of the discussion, and even the quality of the information presented, among other factors. This decision process can be modeled by the competition of several simultaneously on-going threads to win the attention of the user, i.e., the user selects the winning thread for publishing a contribution. This competition is modeled by a neurophysiological model of choice, the leaky competition accumulator (LCA) [9,76,77], where the computational neurons activity is driven by a set of linear differential equations that accumulate inhibitive contributions from other neurons, excitatory input units, and fluctuations from and independent white noise source. LCA has been shown to account successfully for reaction time distribution empirically observed in psychophysical experiments. Specifically, for some combinations of parameter inhibition and decay values, LCA has been shown to reproduce the empirically observed violations of expected value and preference reversals reported in many experiments on value-based preferential choice. These studies focus on the distribution of the decision time for a fixed error ratio after many repetitions of the LCA run trying to mimic the distributions found empirically. LCA parameters are hand tuned (or explored in a grid search) in order to find the values that reproduce the desired response time behavior and the expected choice error ratio understood as choosing the lowest value option. Our work is more akin to machine learning approaches to model the decision process, i.e. we use LCA as decision making model whose performance is measured by the prediction accuracy of the decision made by the users to post a content contribution to a specific conversation thread where the semantic value assigned to the conversation thread is treated as a constant input.
For our specific work, we propose an extended LCA (ELCA) model in several aspects. First, the model includes many simultaneous choices by many users, while classical LCA considers a single agent and a small number of choices. Secondly, we use the semantic modeling of users and threads to compose the input value of each choice, thus linking the abstract valuation of the choices to concrete domain related evidences. Thirdly, we implement a genetic algorithm search for the ELCA model parameter calibration (aka training) using data from the content contribution decisions in a real life VCoP. The recovery of LCA parameters, stated as the induction of model parameters from simulation accumulator trajectories, has been acknowledged as an open difficult problem [49], which has been tackled by exploitation of Lie symmetries for a modified formulation of LCA equations [45]. Contrary to these approaches, we look for the optimal ELCA parameters that reproduce the actual user decisions after convergence of the simulation. However, our work does not try study or reproduce human choice phenomena, such as preference reversal, that are the original domain of study of the LCA model [9,76,77].
Semantic analysis of OSN published content is a current hot research area that allows to detect and prevent undesirable uses of the OSN. For instance, the semantic analysis at word level has been reported to allow to detect cyberbullying [30], helps detecting drunken tweets [24], and the age of users [56]. Also, social media posts content analysis allows to predict depression levels [2]. Specifically, we use unsupervised latent Dirichlet allocation (LDA) [8] topic analysis for the semantic modeling of the OSN published content, that allows to build up quantitative vectorial semantic representations of both users and conversation threads, not much unlike the social semantics neurobiological model based on conceptual knowledge [7]. LDA is a powerful tool that has been used to summarize and build network models of contents, such as semantic graphs relating publications about COVID-19 [1].
Paper contributions and contents This paper proposes a neuro-semantic model of the decisions made by the users to contribute contents to a VCoP web forum at the microscopic level. Specific contributions of this work are: • The semantic characterization of the messages posted in the VCoP web forum is extracted by unsupervised formal topic analysis, namely LDA, allowing the semantic modeling of both users and conversation threads, so that user interest in generating content for a conversation thread can be quantified and assigned as an input value for the neurophysiological model of choice making, namely LCA. • Ancillary information identifying key members of the social network provided by the online social network (OSN) administrators is used for the stratification of users improving the detail of the model of the content generation decision process. • An extended LCA neurophysiological model of the user individual decision process to generate and contribute content in three ways: (1) use of semantically grounded value of the various choices, (2) the consideration of many choices and decision agents in a concurrent dynamic process, and (3) the estimation of the model parameters by maximizing prediction accuracy carried out by a genetic algorithm search. to the OSN that uses as input the semantic characterization of the users and the conversation threads. • Prediction accuracy is based on a graph representation of the user contributions as a bipartite graph where nodes are either users or conversation threads, and edges correspond to the publication of a post by a user in a thread. Prediction performance measures are based on the distance between the ground truth graph extracted from the dataset and the predicted graph measured in terms of shared edges.
The paper is organized as follows: Sect. 2 presents related works on OSN information diffusion. Section 3 describes the materials and methods, including the description of the dataset, the semantic modeling, and the proposed neurosemantic model for user content publication decisions. Section 4 reports the details and results of the computational experiments conducted. Finally, Sect. 5 gives our conclusions and future work directions.

Related works
A great deal of the literature on OSN dynamic analysis has been focused the propagation of information across the network and the detection of communities and key influencer users. Table 1 gives a non-exhaustive summary of works found in the literature since 2007. There are two main research lines on models of information diffusion in networks [42], namely the explanatory and the predictive models. The first line of research includes modeling inspired in epidemics, while the second includes propagation models such as the cascade [20] or the linear threshold models [23]. This research is of utmost importance to areas like marketing, advertising, epidemiology, and social media analysis [79]. Some approaches to information spread modeling rely only on graph theory results [3,71] assuming complete knowledge of the network, but they don't report empirical validation over real data, some are purely speculative [27,35,52,59,69,74,81]. Aggregated predictions of macroscopic or mesoscopic behaviour of information diffusion have been also proposed [18,26,[78][79][80]. For example, modeling the spread of information as epidemic propagation predicts the number of users that belong to the infected class [78][79][80] instead of trying to predict the individual infection. Other works model the density function of the distribution of influenced users [26], the node influence derived from the network topological properties [18], or the macroscopic information dissemination as the propagation of a signal over the network where interference between events is modeled by signal convolution [58]. At the microscopic level, learning from data the payoff of the social agents decisions allows accurate prediction of information diffusion [40]. Machine learning predictors of twitter activity have been developed [55], however data is not always available for confirmation of results. The role of topicality in Twitter adoption has been considered via machine learning predictive models [22] where topics correspond to selected hashtags, discovering that topicality plays a major role at microscopic information propagation. Hashtag topics are also used in the construction of the similarity measure underlying a radiation transfer model for influence prediction [5], but their role is not isolated. On the other hand, the semantic modeling of the information content published in the OSN is gaining attention. For instance, semantic analysis of social networks weibo and twitter based on single word topics has been applied to study the public perception on vaccines against COVID-19 [46]. It has been shown that semantic modeling of user contents allows for improved community detection [28,82]. The impact of specific events on the social media can be assessed using semantic modeling. For instance, an approximate model [17] is shown to detect events in the social median, while event summarization on the basis of tweets can be achieved by a deep learning architecture [21]. Specifically, topic analysis by LDA has been used to uncover the meaning of events in social media [44] and the evolution of contents in the social media [15]. Notably, sentiment analysis has been proposed to predict song contest results [16]. For recommender systems, LDA-based topic hybrid recommender system has been proposed [33], and semantic analysis for recommendations has been also used in learning environments [32]. Moreover, semantic modeling of the user interactions with a chatbot allows for personalized interactions [43]. Semantic analysis may be extended in the time domain, allowing to measure changes in contents dynamically. Topic dynamics was applied to track the emergence of influential tweets about Fukushima disaster [53] over a long period of time. The consideration of both time and content allowed to monitor changes is a VCoP where the user exchange information about cosmetics [67].

Computational pipeline
The computational pipeline of this paper is shown in Fig. 1. It encompasses 5 phases corresponding to the numbered boxes in the figure going from left to right): (1) Data Mining Process: in this phase we carry out the curation and preprocessing of the raw OSN data described in Sect. 3.2. Section 3.3 describes data curation and preprocessing. Moreover we build a characterization of each forum contribution by LDA semantic unsupervised topic analysis. Section 3.4 gives a short overview of LDA. (2) Expert Training data Labeling (ETL): in this phase we prepare the user categorization using information from experts (i.e. the network administrators) as described in the Sect. 3.2. This categorization modulates some of the LCA parameters as discussed below.
(3) Neurophysiological Model Setup: in this phase we formulate the LCA neural model that simulates the process of decision making for a content contribution published in some thread of a sub-forum. Our extended LCA (ELCA) is described in Sect. 3.6. From the LDA semantic model we construct the value of each conversation thread for each relevant user that will be the input for the ELCA contribution decision prediction. This construction is described in Sect. 3.5. (4) Parameter Calibration: We set up the genetic algorithm optimization to find the best parameter values of the neural model. The objective function is defined as the predictive performance over a subset of the dataset selected for model calibration. The genetic algorithm searches for the optimal settings of the LCA parameters using the data reserved for training. The genetic algorithm is described in Sect. 3.7. An algorithmic description of the prediction of posts using the ELCA model is given in Algorithm 1, where the optimal values of the parametersb c ,ĵ c , andk c have been already estimated by the genetic algorithm that is described in Algorithm 2.

Experimental dataset
The experimental works reported in this article are carried out over the data extracted from a web-based forum called Plexilandia, which was implemented as an OSN with more than 2500 members. Plexilandia supports a Virtual Community of Practice (VCoP) [6,14,62,63,65] specifically devoted to tinkering with musical apparatus that has been running for over 15 years. We have access to data from its greatest activity epoch, spanning 9 years. Table 2 contains the number of content publications per sub-forum along these 9 years, including the total number of posts. From now on, we may use the word ''post'' meaning a content contribution to a sub-forum. The topics treated within Plexilandia's forum are arranged into sub-forums according to the interest of the VCoP members that frequent it, namely Table 2 identifies the following sub-forums: Amplifiers, Effects, Luthiers, General, Audio for professionals, and Synthesizers. Contents published in such sub-forums should be strictly related to the purpose of the community, although spurious topics may emerge from unrestricted user interaction. The forum hierarchical structure of sub-forums is illustrated in Fig. 2.
Content contributions of users are conducted inside conversations that we will be denoting as threads. A thread about some discussion begins with a message posted by a user, containing a question or the presentation of an idea for discussion. Then, the different members of the community post their contributions thus increasing the shared knowledge about the central theme of the conversation. Each publication in the thread is composed of elements such as the user identifier (ID); the content contribution, which depending on the forum can be text, images, links to other pages, videos, and the management information of the forum system, such as publication creation date, the thread, and the topic it belongs to. All these elements might be taken into consideration but in this paper only the text content of posts will be exploited to build and analyze the social network.

Experimental training and validation data setup
According to the content structure of the Plexilandia Web Forum, the dataset is partitioned into sub-forums. For the computational experiments five sub-forums are considered. After examination of the distribution of the number of posts for different sizes of time periods (1 week, 2 weeks, 1 month, 2 months, 4 months) and the behavior of the threads during that time, a time period of 1 month has been selected, therefore aggregating the data into 13 time periods. The number of active users, active threads, and posts made during each of these 13 monthly time periods for each of the sub-forums is shown in Table 3. We provide an approximate ratio of imbalance (IBR) of each sub-forum computed as the number of possible content contributions, i.e. number of active users times the number of active threads, divided by the number of actual posts. Figure 3 shows the data partition for the validation experiments, using the data from the first month of 2013 (January) for the ELCA model calibration and the remaining months for testing. In other words, 8% of the data is used for the estimation of the optimal ELCA parameters by a genetic algorithm, and 92% for testing. Thus, model validation is set in the framework of training data scarcity, which is more realistic that training data abundance (such as when using 70% for training, 30% for testing) when trying to predict the online evolution of an OSN.

Categories of users
The OSN administrators provided a stratification of members for the year 2013 into four user categories [63] according to the role that they play in keeping the forum alive: • Experts Type A: which are the most important keymembers that create and sustain meaningful threads in relevant sub-forums. There are 34 such members based on administrators' criteria. • Experts Type B: which are also very important but to a lesser degree than A-type key-members. They • Non-experts or Type X: this class contains all members of the social network which are not key-members. They don't belong to the social network core and usually, they ask questions rather than publish answers or tutorials.  We use only the data for the years 2013 and 2014 because we only have the information regarding key-members for these years [63]. We use the data of sub-forums 2 to 6. Discarding sub-forums 1 and 7 because they have not enough posts to contribute to the analysis.

Data curation and preprocessing
The first step in our computational pipeline is the Plexilandia's data curation and preprocessing [75]. First, we filter out the quotes from previous content contributions posted in the thread. A user can respond to a post by creating a new content contribution including a copy of the cited post plus the additional text of the new contribution. Therefore, it is necessary to delete the replicated part of the new post retaining only the new text input. Next, we transform the acronyms or abbreviations, eliminate spelling errors, and all elements of the posts that make them not comparable. This process is carried out by two natural language processing techniques: stemming and removing stop words. This serves to make posts comparable and to reduce the number of words used to compute post comparison. We apply LDA unsupervised topic modeling described in the next section for the semantic modeling of the content of documents [61].

LDA topic analysis for semantic modeling
In this section we, give a brief account of the Latent Dirichlet Allocation (LDA) topic analysis used for semantic modeling. Let V be a vector of size jVj in which every row represents a different word used in the network, i.e. the vocabulary. Let v i be the word in place i of vector V. It is possible to represent post p j as a sequence of S j words out of V, with S j ¼ jp j j, where j 2 f1; . . .; jPjg and P corresponds to the number of posts that have been published in the VCoP forum. A corpus is defined as a collection of posts C ¼ fp 1 ; . . .; p N g. We can define the matrix W of size jVj Â jPj where each element w i;j of this matrix is defined as the number of times the word v i appears in post p j . Then P jVj i¼1 w i;j ¼ S j . Likewise, we can define P jPj j¼1 w i;j ¼ T i which represents the total number of appearances of the term w i in the corpus.
A corpus can be represented by the product of the term frequency and the inverse document frequency (TF-IDF) matrix M of size jVj Â jPj [68], which is defined as follows: each entry m i;j in the matrix is determined as where n i is the number of posts including the word w i , T i is the maximum number of appearances of word w i in any post. The IDF term presented in Eq. (1) contains a correction with respect to the original IDF term log jPj n i h i to avoid undefined results when a post does not contain words after data curation. For dimension reduction we employ of an unsupervised topic discovery technique, namely, the LDA [4,8] using the Gibbs sampling implementation [57]. This implementation does not search for the optimal values of the hyper-parameters a, b, and number of required topics jTj ¼ k, so we have to make an empirical exploration to find them. LDA provides us with the distribution of each word over the discovered topics, the distribution of topics over the posts, and the n most important words that represent each topic together their belonging probabilities. In order to have fixed size probability vectors representing each topic jVj, we pad them with zeros. These vectors are the columns of the semantic matrix (SM) Terms Â Topics ½ . In order to obtain the semantic Fig. 3 Experimental setup of data exploitation for model validation. Red dots correspond to months with missed data. Blue dots correspond to months whose data is used for training. Green dots correspond to months whose data is used for testing (color figure online) Neural Computing and Applications (2022) 34:16717-16738 16725 description of the posts in a matrix of size Posts Â Topic ½ , we multiply the SM with M t , the transpose of the TF-IDF matrix defined by Eq. (1). The resulting Posts Â Topic ½ matrix contains the semantic explanation of each post as a linear combination of the discovered topics via their vector semantic representations given by the rows of the matrix, denoted q p ; p 2 P È É .

From semantic modeling to valuation
Let us denote U, T H, and S F the set of users, the set of threads, and the set sub-forums in the virtual community, respectively. The results of the LDA semantic analysis, namely the vectors q p , allows to induce each user ðu 2 UÞ multi-topic preference vector representation, and each thread T H semantic content vector representation. The process to compute these semantic representations is as follows: 1. We aggregate the users content contributions according to the sub-forum S F where they are posted. 2. We discretize the time axis into periods of size Dt, thus creating a set of time periods T. Subsequently, we aggregate the content contributions from each subforum according to the time ðt 2 TÞ period they belong to. 3. We extract the users (U t f ) and threads (T H t f ) that are active during each time period. A user u is active in sub-forum f and period t if he makes a content contribution during this period. A thread h in subforum f is active if any user makes a content contribution to the thread during period t. 4. The thread semantic content vector representation for a period, denoted m t h , is the mean of the semantic vector representations q p for the content contributions that belong to both the thread h and the period t, formally: where Pðh; tÞ ¼ fp 2 P : p is posted in thread h during period tg . 5. To compute the user semantic representation, we categorize into subgroups, denoted s, the content contributions made by a user during a period according to the thread they were posted in. A user will have as many semantic vector representations for a period as threads that he has contributed to during this period. We denote the collection of these vector representations as S t u .
6. A user semantic vector representation for a period t and subgroup of content contributions s, denoted l t u;s , is the mean of the semantic vector representations q p for the content contributions made by the user u in this period of time, formally: where Pðu; s; tÞ ¼ fp 2 P : p is posted by user u in period t and belongs to subgroup sg: Now that we have the multi-topic semantic vector representation of the users and the semantic representation of the threads, we apply the computational pipeline shown in Fig. 4 to obtain the input for the extended LCA that implements the content contribution decision model.
1. First, we select a measure of the similarity v of two semantic vector representations in the topic space. We use the cosine similarity, given by the cosine of the angle formed between two semantic vector representations. Thus, for a user multi-topic preference vector representation l t u;s and a thread semantic content vector representation m t h , the similarity between them is given by where h is the angle between l t u;s and m t h . 2. Then, we define a function W 1 mapping semantic similarity into user utility. The utility that a user extracts from a thread is the expected number of times he chooses the thread over other threads to make a content contribution. Consider that p ¼ 1 À vðl t u;s ; m t h Þ is the success probability parameter of a geometric distribution. Utility W 1 of the similarity between user and thread semantic representations is defined as follows [11]: Furthermore, the preference of a user for a thread, i.e. the normalized user utility of a thread h, denoted V t u;s;h , takes into account all the threads in the sub-forum, computed by a function W 2 defined as follows: where parameter a modulates the preference of the users to threads whose topics are similar to the topics covered by the user content contributions. The greater the preference, the greater the satisfaction extracted from the conversation. Figure 5 plots an example of the utility values that a user attributes to the threads that are active at some period in time. Notice that only a few threads are of great interest to the user. Most active threads are stacked at the tail of the plot, meaning that they mostly contribute noise to the decision process. Therefore, we reduction in the number of alternative threads that a user takes into account during his decision-making process to generate content, keeping only the m threads with top utility values. This reduction of alternatives is based on classic research results about working memory and attention span [50]. 3. Finally, we define a function X that maps the normalized user utility of each thread into the LCA input associated with the decision to make a content contribution to the thread, denoted I t u;s;h . For this purpose, we make use of random utility theory [11]: where b ðcðuÞÞ is a proportionality parameter of the model that is specific for the category c u ð Þ of the user (defined as A, B, C, or X in Sect. 3.2), and T H t f ðu; mÞ ¼ fh 2 T H t f : h utility is one of the top m for user ug.

Extended leaky competing accumulator (ELCA)
The decision process leading to the contribution of posts to conversation threads is modeled by an extended leaky competing accumulator (ELCA). The original LCA [9,65,76,77] did only consider a decision carried out by a single agent, while our ECLA carries out simultaneously the decision processes of many users simultaneously, i.e., ECLA extends LCA over a community of users undertaking decisions simultaneously. We consider independent processes for each sub-forum f and each time period t. We as the (neural) activation associated with the decision by user u 2 U t f to publish a post in thread h 2 T H t f . The decision process is implemented as dynamic process where the activation units evolve until one of them reaches a given threshold that triggers the corresponding decision. The evolution of the activation units for a user is illustrated in Fig. 6. Moreover, our ELCA has semantically grounded values associated to each choice, the term I t u;s;h defined in Eq. (8), while classical LCA models have arbitrary values tuned by the researcher intuition. Finally, the provide a procedure to estimate the ELCA optimal parameters to reproduce the actual decisions made by the users, in a way similar to the training of conventional machine learning approaches.
The ELCA model describes the evolution of the joint decision process of all users as the simulation of the following set of dynamic stochastic equations: x ðcÞ ij ¼ where the j c parameter models the activation decay of each unit [48]. Lateral inhibition between accumulator units is modeled by the k c parameter. Equation (10)  . This hard limit has some interesting computational properties [9]. This model is in accordance with perceptual decision making [19]. Initial conditions X u ð Þ h ðs ¼ 0Þ are specified by Eq. (11): Parameter l in Eq. (11) denotes the number of times thread alternative h has been chosen previously, and parameter c ! 0 models the effect of repeated choices of the same alternative approaching the asymptotic curve defined in [38]. Recent works have shown convergence to a decision for large number of choices in a modified LCA model [45], but their model is limited to a single agent. They show that it is possible to recover the model parameters by maximum likelihood approach, however, they refer to the reproduction of simulation traces while we deal in the next section with parameter estimation to approximate the user decision behavior extracted from the real OSN data.

ELCA parameter estimation by genetic algorithm
ELCA parameter estimation was implemented by a genetic algorithm (GA) [73] illustrated in Fig. 7 with the following settings: Each individual P g 2 P in the GA population is composed of 12 real valued genes, which are estimations of the parameters of the LCA model for each kind of user in the sub-forum, i.e. P g ¼b c ;ĵ c ;k c ; c 2 A; B; C; X f g n o .
The size of the population was 100 individuals. The initial values of the individuals component parameters was generated following a uniform distribution in the [0, 1] interval. The fitness function is the accuracy of content contribution prediction by the LCA model using the individual parameter settings over the first month of the dataset. In other words, in order to compute the fitness of each individual in the population we run an instance of the LCA simulation comparing its track of post publication decision to the data from the first month. The individual selection for crossover is carried out by Baker's linear-ranking algorithm [70] and roulette wheel selection [36]. Reproductive crossover was implemented by a single point crossover algorithm [60]. Mutation operator was a realvalued mutation [51]. Independent GA searches were carried out for each sub-forum. The details of the implementation, such as population size, number of generations computed, and the implementation of elitist selection policies are specified in Algorithm 2.

Performance measures
As specified in Algorithm 1, the result of the ELCA sim- that are interpreted as predictors of the actual pairs that can be extracted from the ground truth post publications g . We make independent predictions for each time period and sub-forum. These pairs can be visualized as the edges of bipartite graphs that are the predicted and the ground truth publication graphs. We can define true positives as the edges that are in both graphs, true negatives as the edges that are absent from the two graphs, false positives are edges that appear in the prediction but are absent in the ground truth, and false negatives edges that are absent in the prediction but appear in the ground truth.
In order to evaluate the quality of the ELCA predictions, we compute 4 performance measures combining these basic measures. Namely: Recall, Accuracy, Precision, and the F measure. Recall is the ratio of true positives over the actual edges in the provided ground truth data: F measure (aka F 1 score) combines precision and recall measuring the balance between them. It is defined as: Notice that, in our case study, the number of negative edges is much greater than the positive edges, hence the accuracy will be dominated by the prediction of negative edges, i.e. the absence of positive edge prediction, so that it can be high even if there are many missing actual edges. For this reason, we focus the report of results on the F measure that is a more trustful measure in case of high class imbalance.

Experimental results
As described in Fig. 3, for each sub-forum we carry out an independent GA search to obtain the optimal parameters for the ELCA model over the data from month 1. The optimal ELCA parameter values obtained by the GA search for each sub-forum are specified in Table 4. The ELCA model with these parameter settings is used to predict the generation of posts from users on specific threads for each sub-forum and for each month between February 2013 and January 2014. The average prediction performance results of the ELCA approach are given in Table 5. In Table 6, we present the detailed results in terms of the F-measure for each sub-forum and for each month considered within the time frame. The overall mean F-measure score of ELCA across all sub-forum experiments is 0.61.
Comparison with machine learning approaches For comparison, we have carried out the training of conventional machine learning approaches. The dataset for training is extracted from the same period (first month) used to calibrate the ELCA model. For each possible pair of active user u and thread h, we define the feature vector concatenating the semantic descriptions of the user and the thread x u;h ¼ ðl t u;s ; m t h Þ, and the class variable y u;h 2 existing; non-existing f g that signals if there is at least one post by user u in thread h in this time period. The testing Table 4 Optimal ELCA parameter values for each subforum found by independent GA searches over the training data (January 2013)   Bold values correspond to summary values, either total or first order statistics, mean, min and max values   Posts in:   U1  T384, T413  U46  T387  U215  T196   U8  T372, T402  U67  T266, T384, T414, T419, T438  U229  T37, T266, T413, T419   U9  T37, T367, T402, T413  U111  T402  U233  T266   U13  T365, T372  U127  T103, T367, T438  U245  T37, T196, T367, T413   U14  T372, T414  U132  T365, T367  U248  T103   U15  T369, T414  U154  T266, T367, T372  U249  T369   U30  T266, T369  U198  T365   U43  T372, T384  U201  T266, T367, T384, T387, T414 User = U**, conversation threads the user has published posts in = T*** against Tables 7 and 8 confirms that the superiority of the ELCA model is extremely significative (p\1eÀ16).

Discussion
For a qualitative appreciation of the results, Figs. 8 and 9 show the graph representations of the content publication predictions for sub-forum 4 at month 4 and sub-forum 6 at month 10, where violet and black nodes correspond to threads and users, respectively. Green edges correspond to the content contributions that the ELCA simulation predicted correctly, black edges are false positives, and brown edges correspond to false negatives. Tables 9 and 10 display the content publishing rules derived from the ELCA simulation. We can notice that most of the network edges are green and that there is approximately the same amount of predicted edges and ground truth edges, which is a very important structural property we must comply with. There are few false positives compared to the large number of non-existing links. This is the reason for the high values of the accuracy performance measure in Table 5 relative to the other measures which only take into account the true positives. We recall from Table 3 that our sub-forum datasets can be considered as very imbalanced two class datasets if we aim to predict the links between users and threads. It is well known, that most classifiers are biased towards the majority class (here the non-existing links). Undersampling the majority class or over-sampling the minority class are proposed as means to improve the  Posts in:   U1  T46, T610, T840  U151  T788  U229  T610   U9  T840  U163  T840  U237  T46   U16  T610  U180  T703, T788  U241  T46   U32  T703  U207  T610  U257  T788   U75  T788  U228  T840  U279  T46, T610, T840 User = U**, conversation threads the user has published posts in = T*** performance on the minority class, however it is not clear how to carry out these procedures over our sub-forum data. We get the best results in terms of F measure for subforum 6. It seems that the lower number of posts allows a more efficient semantic analysis and makes it easier for the model to find the threads a user finds interest in. A relevant observation is that as the number of posts increases in a sub-forum, the predictive results worsen. A qualitative interpretation is that it becomes harder to predict whether a user will post to a thread based on the semantic description of the content because it is contaminated with spurious unfiltered messages. In Fig. 10 we show the network graph corresponding to the month and sub-forum with worst performance results. We notice a large number of false positives. This led us to investigate further, so in Fig. 11 we show the scatter plot of the number of posts made in a unit period of time (month) versus the F measure score achieved by the neuro-semantic model in the same period. It appears that as the number of posts increases, the performance of ELCA model prediction decreases. As before, our interpretation is that the cause of this decrease is the increased heterogeneity of the semantic content in the thread, which becomes very noisy.
A way in which we could enhance the neuro-semantic model is to incorporate a discrimination behavior for users that will filter out posts that differ too much with the user semantic preference vector [41]. If we consider the temporal behavior of the F measure results within a sub-forum, the scores do not deviate much from the mean value, hence the LCA model is very robust in terms of temporal decay. We associate this behavior with parameter a. In this research, we set the value of a ¼ 50 without further search for an optimal setting. However, this parameter could also be optimized by the GA approach.

Conclusions
This paper presents a neuro-semantic model of the content publication decisions of users in a web forum OSN at the microscopic level, i.e. the model predicts the specific decision of a user to post a message in a specific conversation thread of a sub-forum. We propose an extended leaky competition accumulator (ELCA) neural model that implements the competition of the diverse threads for the attention of the user as a dynamical process. Model parameter estimation was carried out by a genetic algorithm optimization process. To our knowledge, this is the first work where LCA parameters are estimated from data obtained from a social network content generation prediction in order to achieve optimal predictive performance. The revised literature contains rough qualitative settings of the parameters in order to study the emergent behavior according to theories of value based choice. On the other hand, we have not detected some well known choice phenomena like the preference reversals. More in detail analysis might uncover such phenomena in our problem domain. Semantic similarity underlaying the attention mechanism is modeled by unsupervised topic analysis, thus it is fully automated. Results over the data extracted from a real life OSN are quite promising. Specifically the ELCA model improves greatly over standard machine learning approaches, namely random forest (RF) and support vector machines (SVM), using the same kind of semantic information as input features. Best and average F score of ELCA was 0.95 and 0.61, respectively, while for the RF and SVM best F score was 0.60 and 0.63, respectively, and the average F score was 0.19 and 0.21, respectively. The fundamental research into the likelihood maximization approaches to LCA parameter estimation is a priority for future works.
Further work will be directed to a deeper exploration into the fundamentals of Natural Language Processing (NLP) algorithms in order to improve the capture of the real meaning of the posted text documents, overcoming frequentist approaches to model the joint occurrence of words in a document [13]. Automatic ontology creation for a specific domain is a promising approach to tackle this problem. We will explore word embeddings as a very powerful modeling approach at the expense of interpretability.
Finally, another quite exciting research area is topic space metrics. Future work could be addressed to the definition of an adequate distance between multi-topic text vector representations allowing the extraction of the most valuable content generated by users. Besides, the approach developed in this work could be combined with other existing methods that capture topological features of the network looking for an improvement in prediction performance by such a hybrid system. Availability of data and material (data transparency) Data used for the computational experiments will be available in zenodo.org after paper acceptance for publication.
Code availability Specific code for the computational pipeline will be publish with the data in zenodo.orf after paper acceptance for publication.

Declarations
Conflict of interest Authors declare that they do not have conflict of interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.